You have to come up with an initial project idea in 3 weeks. You can start with a disease, a biological question, a method, and others. Please identify and read a recent publication that are interesting to you, and make sure that dataset is publicly available. I expect you to prepare a 10 min presentation, explaining what this data is about, what/why you are interested in, and what you would like to accomplish.
Within this notebook, there are five problems for you to complete. These problems are written in a blockquote:
Homework Problem Example 1: Make a figure.
Install the main package we’ll be using in this notebook,
sva. The data is prepared and contained in the library
bladderbatch. R packages on CRAN can be installed with
install.packages(). Bioconductor packages are installed by
using BiocManager::install(). There may be challenges in
installation procedures. So if basic commands don’t work, please
search.
options(repos = c(CRAN = "https://cloud.r-project.org/"))
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("sva")
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
## CRAN: https://cloud.r-project.org/
## Bioconductor version 3.20 (BiocManager 1.30.25), R 4.4.2 (2024-10-31)
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'sva'
install.packages(c("devtools", "broom"))
##
## The downloaded binary packages are in
## /var/folders/s2/8k6_19794tg76ppn72tt08y00000gn/T//Rtmp501sE2/downloaded_packages
BiocManager::install(c("Biobase", "bladderbatch"))
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
## CRAN: https://cloud.r-project.org/
## Bioconductor version 3.20 (BiocManager 1.30.25), R 4.4.2 (2024-10-31)
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'Biobase' 'bladderbatch'
library(devtools)
library(Biobase)
library(sva)
library(bladderbatch)
library(broom)
library(tidyverse)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
As shown in Alter et al. and Johnson et al., PCA, factor models, or other related methods can be used to identify and correct for batch effects.
Using the Bottomly et al. data from the last week, we will compute correlation between some known variables and the top PCs. Load this data:
con = url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bottomly_eset.RData")
load(file=con)
close(con)
save(bottomly.eset, file="bottomly.Rdata")
load(file="bottomly.Rdata")
ls()
## [1] "bottomly.eset" "con"
edata <- as.matrix(exprs(bottomly.eset))
dim(edata)
## [1] 36536 21
edata[1:5,1:5]
## SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000000001 369 744 287 769 348
## ENSMUSG00000000003 0 0 0 0 0
## ENSMUSG00000000028 0 1 0 1 1
## ENSMUSG00000000031 0 0 0 0 0
## ENSMUSG00000000037 0 1 1 5 0
sumna <- apply(edata, 1, function(x) sum(is.na(x)))
row.variances <- apply(edata, 1, function(x) var(x))
row.means <- apply(edata, 1, function(x) mean(x))
plot(row.variances, row.means, pch=19, main="Mean vs. Variance relationship")
hist(row.means,500)
edata <- edata[rowMeans(edata) > 10, ]
edata <- log2(as.matrix(edata) + 1)
Compute SVD and visualize the top 2 PCs. And using the meta data, we can color each data point accordingly:
edata <- t(scale(t(edata), scale=FALSE, center=TRUE))
svd.out <- svd(edata)
PC = data.table(svd.out$v,pData(bottomly.eset))
ggplot(PC) + geom_point(aes(x=V1, y=V2, col=as.factor(strain)))
ggplot(PC) + geom_point(aes(x=V1, y=V2, col=as.factor(lane.number)))
ggplot(PC) + geom_point(aes(x=V1, y=V2, col=as.factor(experiment.number)))
Compute correlation between a PC and each of measured variables (strain, lane.number, and experiment.number):
print(cor(pData(bottomly.eset)$experiment.number, svd.out$v[,1], method="spearman"))
## [1] -0.5489913
print(cor(pData(bottomly.eset)$experiment.number, svd.out$v[,2], method="spearman"))
## [1] 0.5104656
print(cor(pData(bottomly.eset)$lane.number, svd.out$v[,1], method="spearman"))
## [1] -0.2280568
print(cor(pData(bottomly.eset)$lane.number, svd.out$v[,2], method="spearman"))
## [1] -0.2241248
print(cor(pData(bottomly.eset)$experiment.number, svd.out$v[,1], method="spearman"))
## [1] -0.5489913
print(cor(pData(bottomly.eset)$experiment.number, svd.out$v[,2], method="spearman"))
## [1] 0.5104656
print(cor(pData(bottomly.eset)$lane.number, svd.out$v[,1], method="spearman"))
## [1] -0.2280568
print(cor(pData(bottomly.eset)$lane.number, svd.out$v[,2], method="spearman"))
## [1] -0.2241248
While this approach was once popular, it has many disadvantages. Particularly, we need a large sample size (per batch) and some of PCs may involve both batch effects and biological signals. In some cases, it’s possible that many PCs are moderately related to some technical factors.
We use the microarray gene expression data on 57 bladder samples,
that were processed in 5 batches. This dataset is available as a
R/Bioconductor package, bladderbatch. This dataset is well
known to have been confounded. Read the original study:
As this is also an ExpressionSet, the steps to extract expression data and meta data are identical to the last 2 weeks.
library(bladderbatch)
data(bladderdata)
# sample info
pheno = pData(bladderEset)
# expression data
edata = exprs(bladderEset)
dim(pheno)
## [1] 57 4
dim(edata)
## [1] 22283 57
#edata <- edata[1:10000,]
edata[1:5,1:10]
## GSM71019.CEL GSM71020.CEL GSM71021.CEL GSM71022.CEL GSM71023.CEL
## 1007_s_at 10.115170 8.628044 8.779235 9.248569 10.256841
## 1053_at 5.345168 5.063598 5.113116 5.179410 5.181383
## 117_at 6.348024 6.663625 6.465892 6.116422 5.980457
## 121_at 8.901739 9.439977 9.540738 9.254368 8.798086
## 1255_g_at 3.967672 4.466027 4.144885 4.189338 4.078509
## GSM71024.CEL GSM71025.CEL GSM71026.CEL GSM71028.CEL GSM71029.CEL
## 1007_s_at 10.023133 9.108034 8.735616 9.803271 10.168602
## 1053_at 5.248418 5.252312 5.220931 5.595771 5.025180
## 117_at 5.796155 6.414849 6.846798 5.841478 6.352257
## 121_at 8.002870 9.093704 9.263386 7.789240 9.834564
## 1255_g_at 3.919740 4.402590 4.173666 3.590649 4.338196
It is important to look at phenotype data, to check out seq lanes, prep dates, and other experimental batches. Then, remove rows with NA, NaN, etc.
head(pheno)
## sample outcome batch cancer
## GSM71019.CEL 1 Normal 3 Normal
## GSM71020.CEL 2 Normal 2 Normal
## GSM71021.CEL 3 Normal 2 Normal
## GSM71022.CEL 4 Normal 3 Normal
## GSM71023.CEL 5 Normal 3 Normal
## GSM71024.CEL 6 Normal 3 Normal
sumna <- apply(edata, 1, function(x) sum(is.na(x)))
row.variances <- apply(edata, 1, function(x) var(x))
row.means <- apply(edata, 1, function(x) mean(x))
plot(row.variances, row.means, pch=19, main="Mean vs. Variance relationship")
edata <- edata[row.variances < 6,]
edata.log <- log2(edata)
Scale the rows and apply SVD. For exploration, we are proceeding with data data and centered-and-scaled data:
edata.scaled <- t(scale(t(edata.log), scale=TRUE, center=TRUE))
edata.centered <- t(scale(t(edata.log), scale=FALSE, center=TRUE))
svd.centered.out <- svd(edata.centered)
svd.centered.plot <- data.table(svd.centered.out$v[,1:10], pheno)
svd.scaled.out <- svd(edata.scaled)
svd.scaled.plot <- data.table(svd.scaled.out$v[,1:10], pheno)
Visualize the scatterplot, while labeling each sample with information (batch or cancer):
ggplot(svd.centered.plot) + geom_point(aes(x=V1, y=V2, col=as.factor(batch)))
ggplot(svd.centered.plot) + geom_point(aes(x=V1, y=V2, col=as.factor(cancer)))
ggplot(svd.scaled.plot) + geom_point(aes(x=V1, y=V2, col=as.factor(batch)))
ggplot(svd.scaled.plot) + geom_point(aes(x=V1, y=V2, col=as.factor(cancer)))
Homework Problem 1: Create a table to show the batch effects (refer to Figure 1 in Gilad and Mizrahi-Man, 2015). There are 5 batches (
pheno$batch); how are biological variables and other variables related to study design are distributed among those 5 batches? Explain what could be a problem. Prepare this into a PDF file. Explanation: The distribution of biological and outcome variables is highly uneven across batches. This can potentially lead to batch effects, confounding, and reduced statistical power.
library(bladderbatch)
library(tidyverse)
library(grid)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:Biobase':
##
## combine
## The following object is masked from 'package:BiocGenerics':
##
## combine
data(bladderdata)
pheno <- pData(bladderEset)
batch_table <- table(pheno$batch, pheno$cancer, pheno$outcome)
batch_df <- as.data.frame.table(batch_table)
colnames(batch_df) <- c("Batch", "Cancer", "Outcome", "Count")
batch_summary <- batch_df %>%
filter(Count > 0) %>% # Only keep combinations that exist
arrange(Batch, Cancer, Outcome)
batch_summary
## Batch Cancer Outcome Count
## 1 1 Cancer mTCC 11
## 2 2 Cancer mTCC 1
## 3 2 Cancer sTCC-CIS 13
## 4 2 Normal Normal 4
## 5 3 Normal Normal 4
## 6 4 Biopsy Biopsy 5
## 7 5 Biopsy Biopsy 4
## 8 5 Cancer sTCC-CIS 3
## 9 5 Cancer sTCC+CIS 12
pivot_table <- batch_df %>%
pivot_wider(
id_cols = c("Cancer", "Outcome"),
names_from = "Batch",
values_from = "Count",
values_fill = 0
)
pivot_table
## # A tibble: 15 × 7
## Cancer Outcome `1` `2` `3` `4` `5`
## <fct> <fct> <int> <int> <int> <int> <int>
## 1 Biopsy Biopsy 0 0 0 5 4
## 2 Cancer Biopsy 0 0 0 0 0
## 3 Normal Biopsy 0 0 0 0 0
## 4 Biopsy mTCC 0 0 0 0 0
## 5 Cancer mTCC 11 1 0 0 0
## 6 Normal mTCC 0 0 0 0 0
## 7 Biopsy Normal 0 0 0 0 0
## 8 Cancer Normal 0 0 0 0 0
## 9 Normal Normal 0 4 4 0 0
## 10 Biopsy sTCC-CIS 0 0 0 0 0
## 11 Cancer sTCC-CIS 0 13 0 0 3
## 12 Normal sTCC-CIS 0 0 0 0 0
## 13 Biopsy sTCC+CIS 0 0 0 0 0
## 14 Cancer sTCC+CIS 0 0 0 0 12
## 15 Normal sTCC+CIS 0 0 0 0 0
# Save PDF
pdf("Swiatkowska_problem1.pdf", width = 12, height = 6)
grid.table(pivot_table)
grid.text("Batch Effects Table", x = 0.5, y = 0.95, just = "top", gp = gpar(fontsize = 14, fontface = "bold"))
dev.off()
## quartz_off_screen
## 2
When technical variables are known, we can add that to a well known linear model. Note that for our own convenience, we tell R to use “Normal” as a base factor:
pheno$cancer = relevel(pheno$cancer, ref="Normal")
We will fit this model on one variable, namely the first gene in gene expression data.
mod = lm(edata[1,] ~ as.factor(pheno$cancer) + as.factor(pheno$batch))
print(mod)
##
## Call:
## lm(formula = edata[1, ] ~ as.factor(pheno$cancer) + as.factor(pheno$batch))
##
## Coefficients:
## (Intercept) as.factor(pheno$cancer)Biopsy
## 8.48001 0.40212
## as.factor(pheno$cancer)Cancer as.factor(pheno$batch)2
## 1.36160 0.33272
## as.factor(pheno$batch)3 as.factor(pheno$batch)4
## 1.43092 0.71854
## as.factor(pheno$batch)5
## -0.03006
You now can fit this linear model on all 22283 genes. We look at the coefficients related to the cancer:
pheno$cancer = relevel(pheno$cancer, ref="Normal")
mod = lm(t(edata) ~ as.factor(pheno$cancer) + as.factor(pheno$batch))
names(mod)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "contrasts" "xlevels" "call" "terms"
## [13] "model"
dim(mod$coefficients)
## [1] 7 22281
rownames(mod$coefficients)
## [1] "(Intercept)" "as.factor(pheno$cancer)Biopsy"
## [3] "as.factor(pheno$cancer)Cancer" "as.factor(pheno$batch)2"
## [5] "as.factor(pheno$batch)3" "as.factor(pheno$batch)4"
## [7] "as.factor(pheno$batch)5"
# library "broom" clean up the outputs of LM
# now, we can use ggplot2 to plot various aspects of LM
library(broom)
mod_tidy <- tidy(mod)
ggplot(mod_tidy) + geom_histogram(aes(x=estimate), bins = 100, fill="darkorange")
# however, the previous line of code make a histogram of all coefficients.
# what we need to do is to find estimates of particular regression terms.
mod_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer")
## # A tibble: 22,281 × 6
## response term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1007_s_at as.factor(pheno$cancer)Cancer 1.36 0.283 4.82 1.39e-5
## 2 1053_at as.factor(pheno$cancer)Cancer 0.304 0.115 2.65 1.07e-2
## 3 117_at as.factor(pheno$cancer)Cancer -0.232 0.246 -0.943 3.50e-1
## 4 121_at as.factor(pheno$cancer)Cancer -0.814 0.236 -3.45 1.14e-3
## 5 1255_g_at as.factor(pheno$cancer)Cancer -0.331 0.0964 -3.43 1.22e-3
## 6 1294_at as.factor(pheno$cancer)Cancer 0.579 0.238 2.44 1.84e-2
## 7 1316_at as.factor(pheno$cancer)Cancer -0.608 0.132 -4.60 2.95e-5
## 8 1320_at as.factor(pheno$cancer)Cancer -0.242 0.131 -1.85 7.05e-2
## 9 1405_i_at as.factor(pheno$cancer)Cancer -0.0375 0.460 -0.0815 9.35e-1
## 10 1431_at as.factor(pheno$cancer)Cancer -0.270 0.214 -1.26 2.12e-1
## # ℹ 22,271 more rows
ggplot(mod_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer")) + geom_histogram(aes(x=estimate), bins = 100, fill="darkorange")
# how about the p-values?
ggplot(mod_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer")) + geom_histogram(aes(x=p.value), bins = 100, fill="darkorange")
Explore convenient functions, filter and
select. filter that allow you to choose rows
based on logical statements based on a specific column. With
select, you can select columns.
ComBat effectively remove the unwanted variation due to the known and
specified technical variables. The batch argument only
expect one technical variable. You can also specify a model matrix (as
shown model.matrix) which include further adjustment
variables. This will return the cleaned data, in which you can apply a
linear model.
library(sva)
batch = pheno$batch
combat_edata = ComBat(dat=edata, batch=pheno$batch, mod=model.matrix(~1, data=pheno), par.prior=TRUE, prior.plots=TRUE)
## Found5batches
## Adjusting for0covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
Just because I ran a certain algorithm that is designed to remove batch effects doesn’t necessarily mean that batch effects are removed. It is necessarily to check what has been returned:
class(combat_edata)
## [1] "matrix" "array"
dim(combat_edata)
## [1] 22281 57
combat_edata[1:10,1:10]
## GSM71019.CEL GSM71020.CEL GSM71021.CEL GSM71022.CEL GSM71023.CEL
## 1007_s_at 10.086489 8.593022 8.735539 8.904940 10.279648
## 1053_at 5.519209 5.113330 5.158350 5.262211 5.265270
## 117_at 6.561745 6.432690 6.270195 6.220634 6.020381
## 121_at 8.789972 9.223910 9.316876 9.194753 8.670989
## 1255_g_at 3.899324 4.403977 4.092678 4.221117 4.060227
## 1294_at 7.776515 7.019255 7.158184 6.861669 7.922845
## 1316_at 5.649285 5.970757 5.735519 5.654295 5.419309
## 1320_at 4.804676 4.950993 4.934002 5.054114 4.781217
## 1405_i_at 4.610667 5.134103 5.253515 4.949733 5.716855
## 1431_at 3.611855 3.838787 4.032772 3.929808 3.671817
## GSM71024.CEL GSM71025.CEL GSM71026.CEL GSM71028.CEL GSM71029.CEL
## 1007_s_at 9.961004 9.045476 8.694422 9.899767 10.045201
## 1053_at 5.369203 5.284901 5.256371 5.510407 5.078401
## 117_at 5.748936 6.228249 6.583219 5.959973 6.176811
## 121_at 7.758164 8.904424 9.060979 8.070845 9.587972
## 1255_g_at 3.829740 4.342484 4.120578 3.705901 4.280065
## 1294_at 7.981089 7.380566 7.191836 7.400875 7.738215
## 1316_at 4.974139 5.673043 6.034637 5.157917 5.495240
## 1320_at 4.565796 5.050048 4.983493 4.667656 4.648607
## 1405_i_at 5.670019 5.446907 4.823131 5.894286 4.899287
## 1431_at 3.630528 4.016914 4.053489 3.672133 3.910500
## compare heatmaps before vs. after
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(RColorBrewer)
my_palette <- colorRampPalette(c("blue", "white", "darkred"))(n = 299)
edata_sub <- edata[,]
png("bladder.png",height=700,width=700)
heatmap.2(edata,
main = "Bladder Cancer Data Clustered", # heat map title
notecol="black", # change font color of cell labels to black
density.info="none", # turns off density plot inside color legend
trace="none", # turns off trace lines inside the heat map
margins =c(12,9), # widens margins around plot
col=my_palette, # use on color palette defined earlier
dendrogram="none", # only draw a row dendrogram
scale = "row")
dev.off()
## quartz_off_screen
## 2
png("bladder_combat.png",height=700,width=700)
heatmap.2(combat_edata,
main = "Bladder Cancer Data Cleaned by ComBat", # heat map title
notecol="black", # change font color of cell labels to black
density.info="none", # turns off density plot inside color legend
trace="none", # turns off trace lines inside the heat map
margins =c(12,9), # widens margins around plot
col=my_palette, # use on color palette defined earlier
dendrogram="none", # only draw a row dendrogram
scale = "row")
dev.off()
## quartz_off_screen
## 2
Evaluate if the cleaned data from ComBat has no relation to batch effects:
svd.out.combat <- svd(combat_edata)
svd.combat.plot <- data.table(svd.out.combat$v[,1:10], pheno)
ggplot(svd.combat.plot) + geom_point(aes(x=V1, y=V2, col=as.factor(batch)))
Homework Problem 2: Make heatmaps, BEFORE and AFTER cleaning the data using ComBat, where columns are arranged according to the study design. You must sort the columns such that 5 batches are shown. Cluster the rows, but do not cluster the columns (samples) when drawing a heatmap. The general idea is that you want to see if the Combat-cleaned data are any improvement in the general patterns.
library(bladderbatch)
library(sva)
library(gplots)
library(RColorBrewer)
data(bladderdata)
pheno <- pData(bladderEset)
edata <- exprs(bladderEset)
combat_edata <- ComBat(dat = edata, batch = pheno$batch, mod = model.matrix(~1, data = pheno), par.prior = TRUE, prior.plots = FALSE)
## Found5batches
## Adjusting for0covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
# sort columns by batch
batch_order <- order(pheno$batch)
edata_sorted <- edata[, batch_order]
combat_edata_sorted <- combat_edata[, batch_order]
my_palette <- colorRampPalette(c("blue", "white", "red"))(n = 299)
# heatmap for the UNCORRECTED data
png("heatmap_uncorrected.png", width = 700, height = 700)
heatmap.2(
edata_sorted,
Colv = FALSE, # Do not cluster columns
Rowv = TRUE, # Cluster rows
col = my_palette,
scale = "row", # Scale by row (genes)
main = "Heatmap of UNCORRECTED Data",
trace = "none", # Remove trace lines
dendrogram = "none",
labCol = pheno$batch[batch_order], # Label columns by batch
margins = c(12, 9) # Widens margins around plot
)
dev.off()
## quartz_off_screen
## 2
# heatmap for the COMBAT-CORRECTED data
png("heatmap_combat_corrected.png", width = 700, height = 700)
heatmap.2(
combat_edata_sorted,
Colv = FALSE, # Do not cluster columns
Rowv = TRUE, # Cluster rows
col = my_palette,
scale = "row", # Scale by row (genes)
main = "Heatmap of ComBat-CORRECTED Data",
trace = "none", # Remove trace lines
dendrogram = "none",
labCol = pheno$batch[batch_order], # Label columns by batch
margins = c(12, 9) # Widens margins around plot
)
dev.off()
## quartz_off_screen
## 2
Homework Problem 3: Make heatmaps of Pearson correlations statistics of samples. For example, see Figure 2 and 3 freom Gilad and Mizrahi-Man (2015) F1000Research: . First, compute the correlation statistics among columns. Second, create a heatmap using heatmap.2(). Make sure to create or add labels for samples (cancer vs. normal; batch numbers; others)
# Pearson correlation
cor_uncorrected <- cor(edata, method="pearson")
cor_combat <- cor(combat_edata, method="pearson")
samples_labels <- paste("Batch", pheno$batch, "-", pheno$cancer)
png("heatmap_correlation_uncorrected.png", width = 700, height = 700)
heatmap.2(
cor_uncorrected,
Colv = TRUE, # Cluster columns
Rowv = TRUE, # Cluster rows
col = my_palette,
scale = "none", # No scaling (correlation values are already normalized)
main = "Heatmap (UNCORRECTED)",
trace = "none", # Remove trace lines
dendrogram = "both", # Show for rows and columns
labRow = samples_labels, # Label rows with batch and cancer status
labCol = samples_labels, # Label columns with batch and cancer status
margins = c(12, 9) # Widens margins around plot
)
dev.off()
## quartz_off_screen
## 2
# Create a heatmap for the COMBAT-CORRECTED correlation matrix
png("heatmap_correlation_combat_corrected.png", width = 700, height = 700)
heatmap.2(
cor_combat,
Colv = TRUE, # Cluster columns
Rowv = TRUE, # Cluster rows
col = my_palette,
scale = "none", # No scaling (correlation values are already normalized)
main = "Heatmap (ComBat-CORRECTED)",
trace = "none", # Remove trace lines
dendrogram = "both", # Show for rows and columns
labRow = samples_labels, # Label rows with batch and cancer status
labCol = samples_labels, # Label columns with batch and cancer status
margins = c(12, 9) # Widens margins around plot
)
dev.off()
## quartz_off_screen
## 2
Now we can fit the linear model with a cleaned data.
modcombat = lm(t(combat_edata) ~ as.factor(pheno$cancer))
# library "broom" clean up the outputs of LM
# now, we can use ggplot2 to plot various aspects of LM
library(broom)
modcombat_tidy <- tidy(modcombat)
# histogram of estimates of particular regression terms.
ggplot(modcombat_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer")) + geom_histogram(aes(x=estimate), bins = 100, fill="darkorange")
# different way of looking at coefficients from many different models
# the vertical line indicates the zero coefficient.
ggplot(modcombat_tidy, aes(estimate, term)) +
geom_point() +
geom_vline(xintercept = 0)
Compare the empirical Bayes estimates to the conventional linear models. In the scatter plot below, the red line indicates the identity. The blue line indicates the linear line that has been fitted into the actual estimates from two approaches. What we observe is that the estimates from the ComBat-cleaned data are shrunken towards 0 compared to the estimates from the previous linear models. This phenomenon is called shrinkage or regularization, which plays a critical role in high-dimensional data analysis.
lm_genes <- mod_tidy %>%
filter(term == "as.factor(pheno$cancer)Cancer") %>%
pull(response)
combat_genes <- modcombat_tidy %>%
filter(term == "as.factor(pheno$cancer)Cancer") %>%
pull(response)
# Find common genes between both models
common_genes <- intersect(lm_genes, combat_genes)
# Create a tibble with only the common genes
est_compare <- tibble(
LinearModel = mod_tidy %>%
filter(term == "as.factor(pheno$cancer)Cancer", response %in% common_genes) %>%
arrange(response) %>% # Ensure same order
pull(estimate),
)
ComBat = modcombat_tidy %>%
filter(term == "as.factor(pheno$cancer)Cancer", response %in% common_genes) %>%
arrange(response) %>% # Ensure same order
pull(estimate)
ggplot(est_compare, aes(x=LinearModel, y=ComBat)) +
geom_point(col="darkgrey", alpha=.5, size=.5) + geom_abline(intercept=0, slope=1, col="darkred") + geom_smooth(method = "lm", se = TRUE) + theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
Let’s look at the p-values. The majority of variables are still very significant, although it’s less so that p-values from the least square method.
ggplot(modcombat_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer")) + geom_histogram(aes(x=p.value), bins = 100, fill="darkorange")
Homework Problem 4: Apply two different Linear Models to the Bottomly et al. data. First, using a conventional approach, create a linear model with a genetic strain (biological variable) and an experimental number (technical variable) on uncorrected gene expression data.
Second, create a linear model with a genetic strain (biological variables) on corrected gene expression data from ComBat. Make a scatter plots of coefficients and a histogram of p-values as done in this notebook. Make sure that you are pulling out the correct coefficients, not any or all coefficients.
library(sva)
library(broom)
library(ggplot2)
library(data.table)
edata <- as.matrix(exprs(bottomly.eset))
pheno <- pData(bottomly.eset)
combat_edata <- ComBat(dat = edata, batch = pheno$experiment.number, mod = model.matrix(~1, data = pheno), par.prior = TRUE, prior.plots = FALSE)
## Found 24229 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.
## Found3batches
## Adjusting for0covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
mod_uncorrected <- lm(t(edata) ~ as.factor(pheno$strain) + as.factor(pheno$experiment.number))
mod_uncorrected_tidy <- tidy(mod_uncorrected)
mod_combat <- lm(t(combat_edata) ~ as.factor(pheno$strain))
mod_combat_tidy <- tidy(mod_combat)
unique(mod_combat_tidy$term)
## [1] "(Intercept)" "as.factor(pheno$strain)DBA/2J"
unique(mod_uncorrected_tidy$term)
## [1] "(Intercept)" "as.factor(pheno$strain)DBA/2J"
## [3] "as.factor(pheno$experiment.number)6" "as.factor(pheno$experiment.number)7"
# p-values
ggplot(mod_uncorrected_tidy %>% filter(term == "as.factor(pheno$strain)DBA/2J"), aes(x = p.value)) +
geom_histogram(bins = 100, fill = "darkorange") +
theme_bw() +
labs(title = "P-values for Strain (Uncorrected Data)")
## Warning: Removed 22604 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(mod_combat_tidy %>% filter(term == "as.factor(pheno$strain)DBA/2J"), aes(x = p.value)) +
geom_histogram(bins = 100, fill = "darkorange") +
theme_bw() +
labs(title = "P-values for Strain (ComBat-Corrected Data)")
## Warning: Removed 22604 rows containing non-finite outside the scale range
## (`stat_bin()`).
est_compare <- tibble(
Uncorrected = mod_uncorrected_tidy %>% filter(term == "as.factor(pheno$strain)DBA/2J") %>% pull(estimate),
ComBat = mod_combat_tidy %>% filter(term == "as.factor(pheno$strain)DBA/2J") %>% pull(estimate)
)
# Scatter plot of coefficients
ggplot(est_compare, aes(x = Uncorrected, y = ComBat)) +
geom_point(col = "darkgrey", alpha = 0.5, size = 0.5) +
geom_abline(intercept = 0, slope = 1, col = "darkred") +
geom_smooth(method = "lm", se = TRUE) +
theme_bw() +
labs(title = "Scatter Plot of Coefficients (Uncorrected vs. ComBat)",
x = "Uncorrected Coefficients",
y = "ComBat Coefficients")
## `geom_smooth()` using formula = 'y ~ x'
When the technical variables are not known or there are additional dependence across the noise term, SVA can be used to estimate and correct for a dependence kernel.
The hyper parameter required for SVA is the number of surrogate
variables. This is very challenging with numerous methods available. The
package SVA provides two methods to estimate the number of surrogate
variables, n.sv.
library(bladderbatch)
data(bladderdata)
edata <- exprs(bladderEset)
pheno <- pData(bladderEset)
edata <- as.matrix(edata)
mod = model.matrix(~as.factor(cancer), data=pheno)
set.seed(1)
rnorm(1)
## [1] -0.6264538
# permutation procedure from Buja and Eyuboglu 1992
num.sv(edata,mod,method="be")
## [1] 9
# asymptotic approach from Leek 2011 Biometrics.
num.sv(edata,mod,method="leek")
## [1] 2
We will go with the Leek 2011 method, e.g.,
method="leek".
We fit SVA without specifying any known technical variables. Essentially, we are hoping that SVA can recover the batch effects (including 5 batches that we know).
mod = model.matrix(~as.factor(cancer),data=pheno)
mod0 = model.matrix(~1, data=pheno)
sva_output = sva(edata, mod, mod0, n.sv=num.sv(edata,mod,method="leek"))
## Number of significant surrogate variables is: 2
## Iteration (out of 5 ):1 2 3 4 5
Once SVs are estimated, we proceed to check how they may be related to the known technical variables. See the LM output:
head(sva_output$sv)
## [,1] [,2]
## [1,] -0.027168731 0.003886997
## [2,] -0.006180066 0.198882692
## [3,] 0.077370249 0.197930599
## [4,] -0.001137525 0.059063230
## [5,] -0.020617855 -0.044496006
## [6,] 0.085287640 -0.191534921
# summary shows how the batches are related to SV1 and SV2 separately.
# which SV have more information about pheno$batch?
summary(lm(sva_output$sv ~ pheno$batch))
## Response Y1 :
##
## Call:
## lm(formula = Y1 ~ pheno$batch)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26953 -0.11076 0.00787 0.10399 0.19069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.018471 0.038694 -0.477 0.635
## pheno$batch 0.006051 0.011253 0.538 0.593
##
## Residual standard error: 0.1345 on 55 degrees of freedom
## Multiple R-squared: 0.00523, Adjusted R-squared: -0.01286
## F-statistic: 0.2892 on 1 and 55 DF, p-value: 0.5929
##
##
## Response Y2 :
##
## Call:
## lm(formula = Y2 ~ pheno$batch)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.23973 -0.07467 -0.02157 0.08117 0.25629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.121112 0.034157 3.546 0.000808 ***
## pheno$batch -0.039675 0.009933 -3.994 0.000194 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1187 on 55 degrees of freedom
## Multiple R-squared: 0.2248, Adjusted R-squared: 0.2108
## F-statistic: 15.95 on 1 and 55 DF, p-value: 0.0001944
Now, perhaps that SV2 and SV3 are strongly related to the batch effect (i.e. technical variable).
Lets make the scatter plot using SV1 and SV2. The data points are colored by their pheno data.
sva_batch <- tibble(SV1=sva_output$sv[,1],
SV2=sva_output$sv[,2],
batch=as.factor(pheno$batch),
cancer=as.factor(pheno$cancer),
outcome=as.factor(pheno$outcome))
ggplot(sva_batch) + geom_point(aes(x=SV1,y=SV2, col=batch))
ggplot(sva_batch) + geom_point(aes(x=SV1,y=SV2, col=cancer))
ggplot(sva_batch) + geom_point(aes(x=SV1,y=SV2, col=outcome))
We further make the violin plots of values of SVs, stratified by the five batches. If the values of SVs are separately (differentially distributed) among batches, that may be an evidence that SVA are capturing the known technical variables.
sva_batch <- tibble(SV1=sva_output$sv[,1],
SV2=sva_output$sv[,2],
batch=as.factor(pheno$batch))
sva_batch_gather <- gather(sva_batch,"sv","value",-batch)
ggplot(sva_batch_gather) + geom_violin(aes(x=batch,y=value)) + facet_wrap(~ sv, ncol = 1)
ggplot(sva_batch_gather) + geom_violin(aes(x=batch,y=value)) + facet_wrap(~ sv, ncol = 1) + geom_jitter(aes(x=batch,y=value,col=batch))
It seems that 2 surrogate variables (SVs) contain substantial information about a known technical variable. Therefore, we proceed to fit the model.
Note that the following code to visualize estimates are rather
complex and long. We are using filter to choose rows
(cancer factors) and select to choose estimates
(coefficients).
# Add the surrogate variables to the model matrix
modsva = lm(t(edata) ~ as.factor(pheno$cancer) + cbind(sva_output$sv))
modsva_tidy <- tidy(modsva)
# Get the gene identifiers from all three result sets
lm_genes <- mod_tidy %>%
filter(term == "as.factor(pheno$cancer)Cancer") %>%
pull(response)
combat_genes <- modcombat_tidy %>%
filter(term == "as.factor(pheno$cancer)Cancer") %>%
pull(response)
sva_genes <- modsva_tidy %>%
filter(term == "as.factor(pheno$cancer)Cancer") %>%
pull(response)
# Find common genes across all three models
common_genes <- Reduce(intersect, list(lm_genes, combat_genes, sva_genes))
est_compare <- tibble(
LinearModel = mod_tidy %>%
filter(term == "as.factor(pheno$cancer)Cancer", response %in% common_genes) %>%
arrange(response) %>% # Ensure same order
pull(estimate),
ComBat = modcombat_tidy %>%
filter(term == "as.factor(pheno$cancer)Cancer", response %in% common_genes) %>%
arrange(response) %>% # Ensure same order
pull(estimate),
SVA = modsva_tidy %>%
filter(term == "as.factor(pheno$cancer)Cancer", response %in% common_genes) %>%
arrange(response) %>% # Ensure same order
pull(estimate)
)
ggplot(est_compare, aes(x=LinearModel, y=SVA)) +
geom_point(col="darkgrey", alpha=.5, size=.5) +
geom_abline(intercept=0, slope=1, col="darkred") +
geom_smooth(method = "lm", se = TRUE) +
theme_bw() +
labs(
title = "Comparison of Linear Model vs. SVA Coefficients",
x = "Linear Model Estimates",
y = "SVA Estimates"
)
## `geom_smooth()` using formula = 'y ~ x'
ggplot(est_compare, aes(x=ComBat, y=SVA)) +
geom_point(col="darkgrey", alpha=.5, size=.5) +
geom_abline(intercept=0, slope=1, col="darkred") +
geom_smooth(method = "lm", se = TRUE) +
theme_bw() +
labs(
title = "Comparison of ComBat vs. SVA Coefficients",
x = "ComBat Estimates",
y = "SVA Estimates"
)
## `geom_smooth()` using formula = 'y ~ x'
At last, let’s look at the p-values from SVA.
ggplot(modsva_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer")) + geom_histogram(aes(x=p.value), bins = 100, fill="darkorange")
It seems that even though the estimates are shrunken towards to zero and the surrogate variables have approximated the technical variables well, the p-values as a whole may not changed so much.
#unique_terms_mod <- unique(mod_tidy$term)
#unique_terms_combat <- unique(modcombat_tidy$term)
#unique_terms_sva <- unique(modsva_tidy$term)
# how many rows match the filter in each dataset
#linear_rows <- sum(mod_tidy$term == "as.factor(pheno$cancer)Cancer")
#combat_rows <- sum(modcombat_tidy$term == "as.factor(pheno$cancer)Cancer")
#sva_rows <- sum(modsva_tidy$term == "as.factor(pheno$cancer)Cancer")
#print(paste("LinearModel matching rows:", linear_rows))
#print(paste("ComBat matching rows:", combat_rows))
#print(paste("SVA matching rows:", sva_rows))
pvalues_linear <- tibble(key = "LinearModel",
value = mod_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer") %>% pull(p.value))
pvalues_combat <- tibble(key = "ComBat",
value = modcombat_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer") %>% pull(p.value))
pvalues_sva <- tibble(key = "SVA",
value = modsva_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer") %>% pull(p.value))
# Combine them with bind_rows
pvalues_gather <- bind_rows(pvalues_linear, pvalues_combat, pvalues_sva)
# Plot
ggplot(pvalues_gather, aes(x=value)) + geom_histogram() + facet_wrap(~key)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# pi0 from the original data ~ 0.26
# pi0 from a combat-cleaned data ~ 0.28
# pi0 from SVA ~ 0.27
Homework Problem 5: Apply ComBat and SVA to the Bottomly et al. data. Make a scatter plots of coefficients and a histogram of p-values, comparing results based on ComBat and SVA. Assume that the biological variables in Bottomly et al data is the genetic strains. Make sure that you are pulling out the correct coefficients/pvalues, not any or all of them.
library(sva)
library(broom)
library(ggplot2)
library(tidyr)
edata <- as.matrix(exprs(bottomly.eset))
pheno <- pData(bottomly.eset)
# Filter low expressed genes
edata <- edata[rowMeans(edata) > 10, ]
edata <- log2(edata + 1)
# Remove any rows with NA values
na_rows <- apply(edata, 1, function(x) any(is.na(x)))
edata <- edata[!na_rows, ]
# ComBat correction
mod_for_combat <- model.matrix(~1, data=pheno)
combat_edata <- ComBat(dat=edata, batch=pheno$experiment.number, mod=mod_for_combat, par.prior=TRUE, prior.plots=FALSE)
## Found3batches
## Adjusting for0covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
# model matrices for SVA
mod <- model.matrix(~as.factor(pheno$strain), data=pheno)
mod0 <- model.matrix(~1, data=pheno)
sva_output <- sva(edata, mod, mod0, n.sv=num.sv(edata, mod, method="leek"))
## Number of significant surrogate variables is: 2
## Iteration (out of 5 ):1 2 3 4 5
# Linear models
mod_combat <- lm(t(combat_edata) ~ as.factor(pheno$strain))
mod_combat_tidy <- tidy(mod_combat)
mod_sva <- lm(t(edata) ~ as.factor(pheno$strain) + sva_output$sv)
mod_sva_tidy <- tidy(mod_sva)
# coefficients and p-values
combat_coef <- mod_combat_tidy %>%
filter(term == "as.factor(pheno$strain)DBA/2J") %>%
pull(estimate)
combat_pvals <- mod_combat_tidy %>%
filter(term == "as.factor(pheno$strain)DBA/2J") %>%
pull(p.value)
sva_coef <- mod_sva_tidy %>%
filter(term == "as.factor(pheno$strain)DBA/2J") %>%
pull(estimate)
sva_pvals <- mod_sva_tidy %>%
filter(term == "as.factor(pheno$strain)DBA/2J") %>%
pull(p.value)
# data frames for plotting
coef_compare <- data.frame(ComBat = combat_coef, SVA = sva_coef)
coef_compare <- na.omit(coef_compare) # na.omit() handle potential NAs in results after model fitting
combat_pvals_df <- data.frame(p.value = combat_pvals)
sva_pvals_df <- data.frame(p.value = sva_pvals)
# Plot coefficients
ggplot(coef_compare, aes(x = ComBat, y = SVA)) +
geom_point(col = "darkgrey", alpha = 0.5, size = 0.5) +
geom_abline(intercept = 0, slope = 1, col = "darkred") +
geom_smooth(method = "lm", se = TRUE) +
theme_bw() +
labs(title = "Coefficients: ComBat vs. SVA",
x = "ComBat Coefficients",
y = "SVA Coefficients")
## `geom_smooth()` using formula = 'y ~ x'
# Plot p-values
ggplot(combat_pvals_df, aes(x = p.value, fill = Method)) +
geom_histogram(aes(x=p.value), bins = 100, fill="darkorange") +
theme_bw() +
labs(title = "P-values Distribution Combat")
ggplot(sva_pvals_df, aes(x = p.value, fill = Method)) +
geom_histogram(aes(x=p.value), bins = 100, fill="darkorange")+
theme_bw() +
labs(title = "P-values Distribution SVA")